Data loading

# load the package
library("terra")

In the first step, we need to create a list of files (rasters) that we are going to load. To do this, we can use the list.files() function, which takes a path to a folder with files as an argument. In addition, we must indicate what kind of files we want to load (pattern = "\\.TIF$") and return full paths to the files (full.names = TRUE).

# listing files from a directory
files = list.files("dane_cut/", pattern = "\\.TIF$", full.names = TRUE)
files
## [1] "dane_cut/LC08_L2SP_191023_20230605_20230613_02_T1_SR_B1.TIF"
## [2] "dane_cut/LC08_L2SP_191023_20230605_20230613_02_T1_SR_B2.TIF"
## [3] "dane_cut/LC08_L2SP_191023_20230605_20230613_02_T1_SR_B3.TIF"
## [4] "dane_cut/LC08_L2SP_191023_20230605_20230613_02_T1_SR_B4.TIF"
## [5] "dane_cut/LC08_L2SP_191023_20230605_20230613_02_T1_SR_B5.TIF"
## [6] "dane_cut/LC08_L2SP_191023_20230605_20230613_02_T1_SR_B6.TIF"
## [7] "dane_cut/LC08_L2SP_191023_20230605_20230613_02_T1_SR_B7.TIF"

Once we have created a list of files, we can load them using the rast() function from the terra package and then display the metadata.

# load raster data
landsat = rast(files)
landsat # calling the object displays the metadata
## class       : SpatRaster 
## dimensions  : 2360, 2469, 7  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 594915, 668985, 5777865, 5848665  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 33N (EPSG:32633) 
## sources     : LC08_L2SP_191023_20230605_20230613_02_T1_SR_B1.TIF  
##               LC08_L2SP_191023_20230605_20230613_02_T1_SR_B2.TIF  
##               LC08_L2SP_191023_20230605_20230613_02_T1_SR_B3.TIF  
##               ... and 4 more source(s)
## names       : LC08_~SR_B1, LC08_~SR_B2, LC08_~SR_B3, LC08_~SR_B4, LC08_~SR_B5, LC08_~SR_B6, ... 
## min values  :          31,          28,        2739,        2745,        7413,        6857, ... 
## max values  :       36332,       34628,       36618,       32237,       50996,       61014, ...

We can also shorten or rename the spectral bands. Before this, make sure that the bands are loaded in the correct order.

names(landsat) # original names
## [1] "LC08_L2SP_191023_20230605_20230613_02_T1_SR_B1"
## [2] "LC08_L2SP_191023_20230605_20230613_02_T1_SR_B2"
## [3] "LC08_L2SP_191023_20230605_20230613_02_T1_SR_B3"
## [4] "LC08_L2SP_191023_20230605_20230613_02_T1_SR_B4"
## [5] "LC08_L2SP_191023_20230605_20230613_02_T1_SR_B5"
## [6] "LC08_L2SP_191023_20230605_20230613_02_T1_SR_B6"
## [7] "LC08_L2SP_191023_20230605_20230613_02_T1_SR_B7"
names(landsat) = paste0("B", 1:7) # shorten the names
names(landsat) # new names
## [1] "B1" "B2" "B3" "B4" "B5" "B6" "B7"
# or alternatively rename
# names(landsat) = c("Ultra Blue", "Blue", "Green", "Red", "NIR", "SWIR1", "SWIR2")

Loading vector data is done in an analogous way using the vect() function.

# load vector data
poly = vect("Poznan.gpkg")
poly
##  class       : SpatVector 
##  geometry    : polygons 
##  dimensions  : 1, 0  (geometries, attributes)
##  extent      : 326611.5, 391121.8, 477976.1, 536838.4  (xmin, xmax, ymin, ymax)
##  source      : Poznan.gpkg
##  coord. ref. : ETRF2000-PL / CS92 (EPSG:2180)

As we can see from the metadata, raster and vector data have different coordinate reference systems (CRS), which is troublesome. The easiest way is to transform the vector data into a raster’s CRS and we can do it with the project() function and specifying the EPSG code.

poly = project(poly, "EPSG:32633")
crs(poly, describe = TRUE)$code # check EPSG after transformation
## [1] "32633"

Now we can prepare a simple visualization using the near infrared band (NIR; B5) and polygon as an example.

# visualization
plot(landsat[[5]], main = "Poznan commune") # alternatively: plot(landsat[["B5"]])
plot(poly, add = TRUE)

Raster processing

The extent of our analysis area is limited to the Poznan commune, while the satellite scene has a much larger extent. In such a situation, we can crop the rasters, so that their further processing will be faster and the results will take up less space on disk. The crop() function is used to crop the rasters, and we need to specify the raster and vector as arguments.

Note that the rasters are represented as matrices, so the cropping is done to the bounding box. To include only the area of our polygon in the analysis, we need to mask the pixels outside the boundary. This can be done by setting the mask = TRUE argument in the aforementioned function.

landsat = crop(landsat, poly, mask = TRUE)
plot(landsat[[5]], main = "Poznan commune")
plot(poly, add = TRUE)

In the next step, we can easily check the descriptive statistics of our dataset.

summary(landsat)
##        B1              B2              B3              B4       
##  Min.   : 4095   Min.   : 1959   Min.   : 8075   Min.   : 7655  
##  1st Qu.: 7813   1st Qu.: 8037   1st Qu.: 8987   1st Qu.: 8408  
##  Median : 8176   Median : 8429   Median : 9659   Median : 9170  
##  Mean   : 8464   Mean   : 8771   Mean   : 9901   Mean   : 9706  
##  3rd Qu.: 8879   3rd Qu.: 9229   3rd Qu.:10408   3rd Qu.:10499  
##  Max.   :23937   Max.   :25640   Max.   :28745   Max.   :30744  
##  NA's   :42589   NA's   :42588   NA's   :42588   NA's   :42588  
##        B5              B6              B7       
##  Min.   : 7413   Min.   : 7447   Min.   : 7406  
##  1st Qu.:16791   1st Qu.:11743   1st Qu.: 9318  
##  Median :18907   Median :13078   Median :10541  
##  Mean   :18980   Mean   :13783   Mean   :11553  
##  3rd Qu.:21018   3rd Qu.:15265   3rd Qu.:12847  
##  Max.   :33376   Max.   :34485   Max.   :37960  
##  NA's   :42588   NA's   :42588   NA's   :42588

As we can see, the spectral reflectance values are several thousand for each band and are encoded as integers. The spectral reflectance should be in the range from 0 to 1. Therefore, we need to scale our data using the following equation (this only applies to reflectance, not temperature!):

\[x = x \cdot 0.0000275 - 0.2\]

For example, the pixel value in the near infrared (NIR) band is 15000. Using the above formula, we need to multiply this value by 0.0000275 (scale factor) and then subtract 0.2 (offset). As a result, we will get a reflection equal to 0.2125. Note that each product/collection has a different formula and it is necessary to consult the documentation.

We don’t need to apply this formula separately for each band in the loop because the arithmetic operations in the terra package is applied to all bands by default.

landsat = landsat * 2.75e-05 - 0.2
summary(landsat)
##        B1              B2              B3              B4       
##  Min.   :-0.09   Min.   :-0.15   Min.   :0.02    Min.   :0.01   
##  1st Qu.: 0.01   1st Qu.: 0.02   1st Qu.:0.05    1st Qu.:0.03   
##  Median : 0.02   Median : 0.03   Median :0.07    Median :0.05   
##  Mean   : 0.03   Mean   : 0.04   Mean   :0.07    Mean   :0.07   
##  3rd Qu.: 0.04   3rd Qu.: 0.05   3rd Qu.:0.09    3rd Qu.:0.09   
##  Max.   : 0.46   Max.   : 0.51   Max.   :0.59    Max.   :0.65   
##  NA's   :42589   NA's   :42588   NA's   :42588   NA's   :42588  
##        B5              B6              B7       
##  Min.   :0.00    Min.   :0.00    Min.   :0.00   
##  1st Qu.:0.26    1st Qu.:0.12    1st Qu.:0.06   
##  Median :0.32    Median :0.16    Median :0.09   
##  Mean   :0.32    Mean   :0.18    Mean   :0.12   
##  3rd Qu.:0.38    3rd Qu.:0.22    3rd Qu.:0.15   
##  Max.   :0.72    Max.   :0.75    Max.   :0.84   
##  NA's   :42588   NA's   :42588   NA's   :42588

We can still see that some values exceed the range of 0 to 1. These are outliers that are usually associated with incorrect measurement or oversaturation. We can solve this problem in two ways:

  1. Replace these values with missing data (NA).
  2. Trim to minimum and maximum value.

The first way can cause us to lose a large part of the dataset. The second way, on the other hand, can cause skewed results.

# method 1
landsat[landsat < 0] = NA
landsat[landsat > 1] = NA
# method 2
landsat[landsat < 0] = 0
landsat[landsat > 1] = 1

After scaling the values, we can display the RGB composition. In this case, instead of plot() function, use plotRGB() function and define the order of red, green and blue bands. In addition, we need to specify the maximum reflection value for the bands (in our case scale = 1). It often happens that compositions are too dark/bright, then is helpful to apply color stretching using the stretch = "lin" or stretch = "hist" argument.

# plotRGB(landsat, r = 4, g = 3, b = 2, scale = 1)
plotRGB(landsat, r = 4, g = 3, b = 2, scale = 1, stretch = "lin")

Clustering

# load clustering package
library("cluster")

Data for modeling must be prepared in an appropriate way. Classification models most often require a matrix or data frame at the training stage. Raster data can be transformed into a matrix using the values() function (or alternatively coercing by as.matrix()).

mat = values(landsat)
nrow(mat) # print number of rows/pixels
## [1] 4182465